Introduction to Web Scraping¶



ONS / NISR
2021

What is web scraping¶

Web scraping (Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique used to automatically extract large amounts of data from websites and save it to a file or database.

The Internet is a data store of world's information - be it text, media or data in any other format. Every web page display data in one form or the other. Access to this data is crucial for the success of most businesses in the modern world. Unfortunately, most of this data is not open. Most websites do not provide the option to save the data which they display to your local storage, or to your own website.

Why do web scraping¶

Web Scraping is used for getting data. Data collection and analysis is important even for government, non-profit and educational institutions.

The following are few of the many common applications of Web Scraping:

  1. In eCommerce, Web Scraping is used for competition price monitoring.

  2. In Marketing, Web Scraping is used for lead generation, to build phone and email lists for cold outreach.

  3. In Real Estate, Web Scraping is used to get property and agent/owner details.

  4. Web Scraping is used to collect training and testing data for Machine Learning projects.

Is Web Scraping Legal?¶

One of the most frequent questions which comes to your mind once you have decided to scrape data is whether the process of web scraping is legal or not. Scraping data which is already available in public domain is legal as long as you use the data ethically.

Additional considerations¶

Whilst the process of web scraping is legal, consideration should be given to the data that you're attempting to collect. Whilst it may be in the public domain, you may not have a legal standing to collect personal or copyrighted data.

Personal Data - As a rule of thumb, it is recommended to have a lawful reason to obtain, store and use personal data without the user’s consent.

Copyrighted Data - It is not illegal to scrape copyrighted data as long as you don’t plan to reuse or publish it.

What are web pages?¶

HTML (Hypertext Markup Language)¶

The backbone of any web page is HTML. This is a relatively simple markup language that uses <tags>, denoted by angle brackets, to markup different elements.

Open https://www.statistics.gov.rw in any web browser and right click on the page and select View Source.

Creating a Basic HTML page¶

As HTML is just a series of <tags> written in plain text, we can create a web page that can be rendered in any browser just using a text editor.

Create a new file called my_webpage.html and add the following text.

<html> <!-- Open the HTML tag to declare that everything inside is HTML -->
    <body> <!-- Open the body tag, this is where we can write visible elements -->
        <h1>Page title</h1> <!-- h1 stands for Heading, see the use of </> to close the tag -->
        <p>This is my webpage.</p> <!-- p stands for paragraph -->
    </body> <!-- Close the body tag -->
</html> <!-- Close the HTML tag-->

There are plenty of other <tags> we can use in HTML, a full list can be found here

Some common ones you'll see are listed below

Tag Usage
<div> Used to group elements together, or to provide structure to the web page
<span> Used to group elements and to provide structure behaves slightly differently to <div>
<img> Adds an image to the web page
<table>, <th>, <tr>, <td> Defines a table in HTML with the sub-elements defining the table header, table row and table cell respectively.
<a> Create a hyperlink around a specific element
<b>, <i> Create bold and italic elements respectively
<ol>, <ul>, <li> Create ordered and unordered lists where <li> tags list items.

Lets create a second web page called my_complex_webpage.html that incorporates some of these other HTML elements.

<html>
    <body>
        <h1>My Complex Webpage</h1>
        <p>This is my more complex webpage with additional elements</p>
        <a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
        <p>Below here is the NISR logo</p>
        <img src="https://www.statistics.gov.rw/sites/default/files/images/logo.png">
        <h2>This is an unordered list of fruits</h2>
        <ul>
            <li>Apple</li>
            <li>Banana</li>
        </ul>
        <h2>This is a HTML table</h2>
        <table>
            <tr><th>Column 1</th><th>Column2</th><th>Column3</th></tr>
            <tr><td>1</td><td>2</td><td>3</td></tr>
            <tr><td>4</td><td>5</td><td>6</td></tr>
            <tr><td>7</td><td>8</td><td>9</td></tr>
        </table>
    </body>
</html>

Cascading Style Sheets (CSS)¶

<HTML> is good for structure but it isn't very useful for styling elements on a web page. That's where Cascading Style Sheets (CSS) comes in. CSS is a separate language that allows us to apply "styles" to elements on our HTML web page.

For example if we wanted to set the background of our web page to black and the font colour to white we could use the following CSS code.

/* The body tells the browser to only apply the contained styles onto the <body> element */

body {  
    background: black; /* Set the page background to black */
    color: white; /* Set the page font colour to white */
}

Save the above code as style.css

There are two ways to add CSS to our web page. We can add it directly into the HTML document using the <style> tags. More commonly you'll see CSS stored in a separate .css file which is linked in the .html file using the <head> and <link> tags.

The <head> tag is like the body tag, but used for store additional meta information that isn't directly displayed on the page.

<html>
    <head>
        <link rel='stylesheet' href='style.css'>
    </head>
    <body>
        ...
    </body>
</html>

Create a copy of my_complex_webpage.html and add the <head> and <link> tags as described above.

CSS is able to define styles not just for types of elements (i.e. <body>, <li>, <p>) but it can also define classes we can be applied to numerous elements.

/* The "." at the start of the definition tells HTML to apply this style to any elements */
/* that have the specified class name. */

.red_text {
    color: red;
}

Add this style to your style.css file.

We can now use the the class attribute on any HTML element to assign this specific style to specific elements.

<html>
    <head>
        <link rel='stylesheet' href='style.css'>
    </head>
    <body>
        <h1>My Complex Webpage</h1>
        <p class="red_text">This is my more complex webpage with additional elements</p>
        ...
        <h2 class="red_text">This is an unordered list of fruits</h2>
        <ul>
            <li class="red_text">Apple</li>
            <li>Banana</li>
        </ul>
        ...
    <head>
<html>

Edit your copy of my_complex_webpage.html to include the class attribute on some tags.

Congrats, you're officially a web designer!¶

Scraping web pages with Pandas¶

Pandas has a built in function called read_html that allows us to read HTML tables directly from a web page. We can try this with the web page we just finished creating by using the following code.

import pandas as pd 
df = pd.read_html('./my_complex_webpage.html')
df
   Column 1  Column2  Column3
0         1        2        3
1         4        5        6
2         7        8        9

Pandas correctly found our table parsing out all our other HTML, but by default read_html will return a list of all tables that pandas can find on the web page, even if its only one.

Pandas is also able to filter out any of the CSS that's been applied to our tables as well, returning on the data.

import pandas as pd

# Select the first / only dataframe in the list
df_no_css = pd.read_html('./my_complex_webpage.html')[0]
df_css = pd.read_html('./my_complex_webpage_with_css.html')[0]

# This will error if the dataframes aren't identical.
pd.testing.assert_frame_equal(df_no_css, df_css)

Real world application¶

Obviously real world websites are much messier than our example page, so we will also need to employ some basic data cleaning techniques to deal with these real world examples.

Lets look at the wikipedia page for the Rwandan Men's National Basketball Team. There are lots of different tables, in different styles, some with images, some with complex headers. We can throw the URL directly into to read_html and see what comes out.

import pandas as pd 

basketball_tables = pd.read_html('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')

print(f'Tables found: {len(basketball_tables)}')
Tables found: 13

Often web developers will use <table> tags as a structural element, rather than to explicitly display some data. Note that the 0th index in basketball_tables doesn't refer to the first visible table, but instead the information card in the top left corner of the page.

Looking through the 13 parsed tables, we can find the current roster table at position 4, but as Wikipedia can change we want to be able to write some code that always selects the roster table. We can do that using the keyword match. The match keyword will return any table containing the string passed.

Once we've done that we can add our usual skiprows and header arguments to make sure the correct row is being used as the header of the table.

url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2)[0]
roster.head()
Pos. No. Name Age – Date of birth Height Club Ctr.
0 PG 4 Jean Nshobozwabyose 23 – 1.83 m (6 ft 0 in) Patriots NaN
1 G 5 Ntore Habimana 24 – 1.96 m (6 ft 5 in) Wilfrid Laurier Golden Hawks NaN
2 SG 6 Steven Hagumintwari 27 – 1.93 m (6 ft 4 in) Patriots NaN
3 SG 7 Armel Sangwe 24 – 1.90 m (6 ft 3 in) Espoir NaN
4 SG 8 Emile Kazeneza 20 – 2.01 m (6 ft 7 in) William Carey University NaN

We've now code that can scrape that table whenever we want. However, something looks a little wrong with the Age - Date of Birth column. Not all the data has been scraped, notably the actual dates of birth.

This is because there is hidden data within these cells. Pandas will stop scraping a cell if it hits hidden data unless we explicitly tell it not to using displayed_only=False.

url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2,
                      displayed_only=False)[0]
roster.head()
Pos. No. Name Age – Date of birth Height Club Ctr.
0 PG 4 Jean Nshobozwabyose 23 – (1998-06-26)26 June 1998 1.83 m (6 ft 0 in) Patriots NaN
1 G 5 Ntore Habimana 24 – (1997-08-15)15 August 1997 1.96 m (6 ft 5 in) Wilfrid Laurier Golden Hawks NaN
2 SG 6 Steven Hagumintwari 27 – (1993-10-01)1 October 1993 1.93 m (6 ft 4 in) Patriots NaN
3 SG 7 Armel Sangwe 24 – (1997-04-15)15 April 1997 1.90 m (6 ft 3 in) Espoir NaN
4 SG 8 Emile Kazeneza 20 – (2000-08-30)30 August 2000 2.01 m (6 ft 7 in) William Carey University NaN

There we go, now we've got all the data we want from the table. Unfortunately, as Wikipedia has used images rather than text to represent the countries of the players, we're unable to scrape them using pandas.

We'll look at other methods to get this data later.

Limitations of Pandas for web scraping¶

Obviously we've seem some of the limitations already, notably pandas not being to parse images and it collecting tables that aren't relevant to our intended goal. However in most web pages, the data we want to scrape wont be formatted into a nice table for us. If it isn't in <table> tags then we wont be able to scrape it using pandas.

  • Good for websites with predefined tables
  • Wont collection information that isn't text

However there are lots of other methods for accessing that data, but first we need to understand a little of how websites function.

How do web sites work?¶

Now that we understand the structure of a web page, we can see how it might be extremely tedious to create every individual web page, especially if we want to include regularly changing data.

That's why most web pages are created dynamically. This means that the web page is put together on-the-fly whenever someone requests to see it.

Client-side versus Server-side scripting¶

Web pages are usually generated in one of two ways, via client-side scripting or via server-side script. This defines where the data gets turned into HTML elements. If it is on the client-side, then the raw data is sent directly to our browser and our computer creates the web page, if it is server-side then we don't ever see the raw data, only the computed HTML elements.

Client-side ScriptingServer-side scripting
Data usually processed with JavascriptData can be processed with php, Javascript, python etc.
Is possible to see the underlying dataIs not possible to see the underlying data

Inspecting a web page's creation¶

We've already looked at a web page's source by using View page source. There is a more advanced tool for working with web pages built into most browsers, usually called Inspect. Right-Click > Inspect. Lets inspect the wikipedia page for the Rwandan Men's National Basketball Team.

We'll come back to the Elements page later, for now we want to look at the Network Tab on the toolbar.

The network tab records all the requests that go between our browser and the server (as well as other servers) in the production of the web page. When you first open the page it will be blank. Refreshing the browser page will cause the network tab to record all the different requests that occur.

Clicking on any one of the files that has been requested, you can see the full HTTP request (more on this later) as well as a preview and actual full response from the server for that request. Looking at the response for the first request (the page) we can see that the data was included directly in the page as HTML. This implies that this particular page was processed server-side. Another clue was the reference to php which is an exclusively server-side language.

Client Side Example¶

Lets look at the NBA website instead. This page shows us statistics for the regular season for players in the NBA ordered by the number of points.

We could try and scrape this data using pandas but lets see if we can find the source of this data first. Opening up the Inspect tool we can look at the Network tab to try and find where this data is loaded from.

There are a lot of files that are loaded as part of this web page. We can reduce the number we need to search through by using the built in filters on the Network tab. Lets look at Fetch/XHR which filters the list to requests usually associated with data.

Looking through this shorter list of files, one stands out as potentially containing the data that we want to extract from the web page.

We can click on the file and the the Response tab to see what information is sent to our browser from this file. Looking at the response we can see that the data that goes into our table is not encoded into HTML so we can be relatively sure that web page is generated at least partly on the client side.

Key takeways¶

  • If a website is processed client side then it may be possible to get at the data that creates the web page without having to parse HTML
  • However some web scraping programs wont be able to execute the client side code, meaning we have to use a web browser.
  • If a website is processed server side then it is not possible to get the data without having to parse the HTML served.

Requests Library¶

The requests library is the de-facto standard for making HTTP requests, it abstracts away all of the complexity we just saw using the Inspect tool. requests is a built in library so there is no need to install.

The requests library is very powerful, but importantly we can use it to do in python what our web browser was doing when it loaded in our data.

Returning to our NBA example. Looking at the Network tab shows us all the of the HTTP requests that have been made in the process of creating the web page that we see.

If we look in the Headers tab, we can see the form that this HTTP request took.

The URL had the request information encoded into it, we can also see that the request type is GET.

Lets see what happens if we recreate that request in Python using the requests library. First we need to get the request URL from the header tab. We also need to note that the method is GET.

import requests

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

# We are using the .get method to match the GET HTTP request 
# we also include the .json() method to return to us the response 
# from the request as a python dictionary.
response = requests.get(url).json()

print(response)

We can see that the result of that command is the same as the data we saw using the Inspect tool. We can look through this nested dictionary object to try and understand the structure of the response. It is important to note that not every response will look the same. You'll need to dig into each response to extract the data as and when.

We can look through the object and see if there is a way to convert information into a table that we can use.

print(response.keys())
print(response['resultSet'].keys())
dict_keys(['resource', 'parameters', 'resultSet'])
dict_keys(['name', 'headers', 'rowSet'])

Looking at the keys in the data, we can see that the response contains three objects called resource, parameters and resultSet. resource and parameters are metadata about the table that we've just requested. resultSet contains another dictionary with the keys name, headers and rowSet. rowSet is a list of list, each representing a row of data and the headers contains a list of column headers.

We can put these together using pandas into a dataframe very easily.

import requests
import pandas as pd 

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

response = requests.get(url).json()
table_headers = response['resultSet']['headers']
table_data = response['resultSet']['rowSet']

df = pd.DataFrame(table_data, columns=table_headers)
df
PLAYER_ID RANK PLAYER TEAM GP MIN FGM FGA FG_PCT FG3M ... FT_PCT OREB DREB REB AST STL BLK TOV PTS EFF
0 201142 1 Kevin Durant BKN 12 34.4 11.2 19.1 0.585 1.9 ... 0.829 0.5 8.0 8.5 5.0 0.6 0.7 3.5 29.5 31.8
1 201939 2 Stephen Curry GSW 11 33.6 8.6 19.9 0.434 5.0 ... 0.949 0.8 5.6 6.5 6.5 1.6 0.6 3.1 27.4 28.0
2 202331 3 Paul George LAC 11 35.3 10.0 21.9 0.456 3.2 ... 0.867 0.5 7.3 7.8 5.4 2.5 0.5 4.5 26.7 26.0
3 203507 4 Giannis Antetokounmpo MIL 12 32.9 9.5 19.2 0.496 1.3 ... 0.688 1.9 9.9 11.8 6.0 1.1 1.8 3.0 26.6 31.8
4 1629630 5 Ja Morant MEM 11 35.3 10.0 20.6 0.485 1.7 ... 0.779 1.3 4.5 5.7 7.3 1.7 0.3 4.0 26.5 25.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
274 1629216 275 Gabe Vincent MIA 10 8.9 0.8 2.0 0.400 0.1 ... 1.000 0.3 0.5 0.8 1.7 0.2 0.0 0.6 1.9 2.8
275 203085 276 Austin Rivers DEN 9 12.4 0.8 3.0 0.259 0.2 ... 0.500 0.4 0.7 1.1 0.8 0.3 0.1 0.7 1.9 1.2
276 1630541 276 Moses Moody GSW 9 6.8 0.8 2.0 0.389 0.2 ... 0.500 0.0 0.9 0.9 0.3 0.0 0.1 0.1 1.9 1.8
277 1626161 278 Willie Cauley-Stein DAL 11 10.0 0.7 1.7 0.421 0.0 ... 0.000 0.7 1.6 2.4 0.5 0.3 0.0 0.2 1.5 3.4
278 1630215 279 Jared Butler UTA 10 4.6 0.5 1.9 0.263 0.2 ... 0.500 0.0 0.6 0.6 0.6 0.0 0.5 0.6 1.4 0.9

279 rows × 24 columns

If we look closer at the URL, we can see it encodes a lot of arguments, these arguments look very similar to the filters that are available on the web page.

https://stats.nba.com/stats/leagueLeaders?
    LeagueID=00&
    PerMode=PerGame&
    Scope=S&
    Season=2021-22&
    SeasonType=Regular+Season&
    StatCategory=PTS

If we change "PerGame" to "Totals" and re-run our code then we should get data that would inform website table had we selected that option. What we've done here is discover the API that sits behind the NBA website and we can exploit this to extract data.

PLAYER_ID RANK PLAYER TEAM GP MIN FGM FGA FG_PCT FG3M ... REB AST STL BLK TOV PF PTS EFF AST_TOV STL_TOV
0 201142 1 Kevin Durant BKN 12 413 134 229 0.585 23 ... 102 60 7 8 42 16 354 381 1.43 0.17
1 203507 2 Giannis Antetokounmpo MIL 12 395 114 230 0.496 16 ... 142 72 13 21 36 36 319 381 2.00 0.36
2 201939 3 Stephen Curry GSW 11 370 95 219 0.434 55 ... 71 72 18 7 34 17 301 308 2.12 0.53
3 202331 4 Paul George LAC 11 388 110 241 0.456 35 ... 86 59 28 5 49 31 294 286 1.20 0.57
4 1629630 5 Ja Morant MEM 11 388 110 227 0.485 19 ... 63 80 19 3 44 16 292 281 1.82 0.43
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
443 1630536 418 Sharife Cooper ATL 2 7 0 3 0.000 0 ... 0 2 0 0 1 0 0 -2 2.00 0.00
444 1629605 418 Tacko Fall CLE 3 3 0 1 0.000 0 ... 2 0 0 0 1 0 0 0 0.00 0.00
445 1628962 418 Udoka Azubuike UTA 2 2 0 0 0.000 0 ... 0 0 0 0 0 0 0 0 0.00 0.00
446 1630176 418 Vernon Carey Jr. CHA 1 1 0 1 0.000 0 ... 1 0 0 0 0 0 0 0 0.00 0.00
447 1627782 418 Wayne Selden NYK 1 1 0 0 0.000 0 ... 0 0 0 0 0 0 0 0 0.00 0.00

448 rows × 27 columns

Beautiful Soup Library¶

Sometimes, in-fact most of the time, the information that we want to scrape wont be found neatly formatted into a table. What we need to be able to do is extract the relevant information programatically from non-table objects. Enter beautifulsoup. beautifulsoup is a HTML parsing library for python, it allows us to pull out all the relevant information from a web page using a nice and easy to use syntax.

beautifulsoup does not come as part of the standard python installation so we need to pip install it. We can do this inside our jupyternotebook using

!pip install beautifulsoup4

Or just on the command line by running the same command, without the ! at the begining of the line.

Requirement already satisfied: beautifulsoup4 in /opt/miniconda3/lib/python3.9/site-packages (4.10.0)
Requirement already satisfied: soupsieve>1.2 in /opt/miniconda3/lib/python3.9/site-packages (from beautifulsoup4) (2.3)

Once we've installed beautiful soup we can start to use it to parse our HTML data. Lets start again by parsing the web page that we made earlier.

from bs4 import BeautifulSoup 

with open('./my_complex_webpage.html', 'r') as f:
    soup = BeautifulSoup(f, 'html.parser')

print(soup)
<html>
<head>
<link href="style.css" rel="stylesheet"/>
</head>
<body>
<h1>My Complex Webpage</h1>
<p class="red_text">This is my more complex webpage with additional elements</p>
<a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
<p>Below here is the NISR lo
...

beautifulsoup has lots of functions that make it very easy to extract information from a HTML page. The most useful of which is the find_all() method. Full documentation for the find_all method can be found here.

Before we were able to use pandas to extract the HTML table very easily, but what if we were more interested in the "Unordered list of fruits". We can use the find_all function to retrieve all of the list item <li> tags.

soup.find_all('li')
[<li class="red_text">Apple</li>, <li>Banana</li>]

We've successfully extracted all of the <li> tags, but our data still isn't very clean. We're not interested in the HTML tag, just the data contained within. We can deal with this by using beautifulsoup to strip out our HTML tags.

# We can do this with a loop
for tag in soup.find_all('li'):
    print(tag.get_text())

# Or by using a list comprehension
[tag.get_text() for tag in soup.find_all('li')]
Apple
Banana
['Apple', 'Banana']

Success, however it will be common that not all the information we want to extract has the same <tag> or there will be lots of irrelevant information that has the same <tag>. Fortunately, when people are designing web pages they tend to give similar information the same visual appearance. We know that visual appearance is controlled by css and using beautifulsoup we can extract data by css class!

for red_text in soup.find_all(class_="red_text"):
    print(red_text)

[red_text.get_text() for red_text in soup.find_all(class_='red_text')]
<p class="red_text">This is my more complex webpage with additional elements</p>
<h2 class="red_text">This is an unordered list of fruits</h2>
<li class="red_text">Apple</li>
['This is my more complex webpage with additional elements',
 'This is an unordered list of fruits',
 'Apple']

Real world example¶

Lets go back to our wikipedia example, if we remember we were able to use pandas to scrape the table, but weren't able to get all of the country information because this wasn't stored as plain text. We can use beautifulsoup to parse out that information with much finer control.

First we need to get the HTML that generates that wikipedia page. We can do this using our trusty requests library.

import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL)
print(wiki_page)
<Response [200]>

Oh, this is just a response code, not the HTML that we were expecting. Fortunately Response [200] means that the request successfully executed. In order to get the HTML we need to use the .text attribute.

import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL).text
print(wiki_page)
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Rwanda men's national basketball team - Wikipedia</title>
<script>document.documentElement.classNam
...

Now we can parse this HTML code with beautiful soup, as we're only interested in the roster table, we can tell beautifulsoup to filter out all the HTML that isn't related to the roster table.

tables = soup.find_all('table')
print(f'Found {len(tables)} tables.\n')

# Filter list of tables to just those that contain a country 
# column called Ctr.
country_tables = [tbl for tbl in tables if 'Ctr.' in str(tbl)]

# This is more complex html than usual, there is a table in a table, so we need to 
# select the second country_table which represents the inner table.
roster_html = country_tables[1]
Found 13 tables.

<table class="sortable" style="background:transparent; margin:0px; width:100%;">
<tbody><tr>
<th><abbr title="Position(s)">Pos.</abbr></th>
<th><abbr title="Number">No.</abbr></th>
<th>Name</th>
<th>Age – <small>Date of birth</small></th>
<th>Height</th>
<th>Club</th>
<th><abbr title="Country">Ctr.<
...

Try parsing the roster_html Beautiful Soup into a pandas dataframe.

# Use a list comprehension to look for all the <th> tags, for each 
# one, get the text and strip the result. These are the column headers
# for the table.
header = [col.get_text().strip() for col in roster_html.find_all('th')]

# Create an empty list to store our processed rows.
rows = []
# Loop over all of the <tr> tags, each set corresponds to a row
# row in our table. 
for tr in roster_html.find_all('tr')[1:]:
    # Create an empty row variable where we can store all of our processed
    # data
    row = []
    # Loop over all of the <td> tags inside the current <tr> tag. These are 
    # going to be our data items.
    for data in tr.find_all('td'):

        # If the data item isn't blank (or just a a new line character)
        # then add it to our row, stripping out the excess whitespace
        if data.get_text() != '\n':
            row.append(data.get_text().strip())

        # If there is an <img> tag in the <td> tag then we're on our 
        # flag column. We want to extract the country information. 
        # We could extract this from the image, but all the images are 
        # wrapped in a <a> hyperlink tag to that country, which will be
        # easier to clean. 
        if data.find('img') is not None:
            # Get the <a> hyperlink tag
            img = data.find('a')
            # Add the href attribute (this is the link address) to our row
            row.append(img['href'])

    # Finally add the row into our list of rows.
    rows.append(row)

# Construct a dataframe from our list of rows and our header data
df = pd.DataFrame(rows, columns=header)
df
Pos. No. Name Age – Date of birth Height Club Ctr.
0 PG 4 Jean Nshobozwabyose 23 – (1998-06-26)26 June 1998 1.83 m (6 ft 0 in) Patriots /wiki/Rwanda
1 G 5 Ntore Habimana 24 – (1997-08-15)15 August 1997 1.96 m (6 ft 5 in) Wilfrid Laurier Golden Hawks /wiki/Canada
2 SG 6 Steven Hagumintwari 27 – (1993-10-01)1 October 1993 1.93 m (6 ft 4 in) Patriots /wiki/Rwanda
3 SG 7 Armel Sangwe 24 – (1997-04-15)15 April 1997 1.90 m (6 ft 3 in) Espoir /wiki/Rwanda
4 SG 8 Emile Kazeneza 20 – (2000-08-30)30 August 2000 2.01 m (6 ft 7 in) William Carey University /wiki/United_States
5 SG 9 Dieudonné Ndizeye 24 – (1996-10-14)14 October 1996 1.98 m (6 ft 6 in) Patriots /wiki/Rwanda
6 PF 10 Olivier Shyaka 26 – (1995-08-14)14 August 1995 2.00 m (6 ft 7 in) REG /wiki/Rwanda
7 F 11 Alex Mpoyo 24 – (1997-01-05)5 January 1997 2.01 m (6 ft 7 in) Trepça /wiki/Kosovo
8 SG 12 Kenny Gasana 36 – (1984-11-09)9 November 1984 1.90 m (6 ft 3 in) Patriots /wiki/Rwanda
9 C 13 Elie Kaje 26 – (1995-03-17)17 March 1995 1.90 m (6 ft 3 in) Patriots /wiki/Rwanda
10 C 16 Prince Ibeh 27 – (1994-06-03)3 June 1994 2.06 m (6 ft 9 in) Patriots /wiki/Rwanda
11 SF 17 William Robeyns 25 – (1996-02-23)23 February 1996 1.91 m (6 ft 3 in) Phoenix Brussels /wiki/Belgium

Non-tabular data¶

Lets look at less tabular data. This is the fiba.basketball news page. We can see there is list news articles, with headlines, dates and a small blurb. To start lets inspect one of the items and see if we can see anything common that link on.

All the additional news articles are in <div> tags with the class related_row.

<div class="related_row">
<a href="http://www.fiba.basketball/afrobasket/2021/qualifiers/news/enabu-rwahwire-reflect-on-uganda-tenacity-ahead-of-morocco-cracker">
<div class="related_top right">
<div class="date_highlighted">07/07/2021</div>
<div class="category" style="background-color: #000000;">News</div>
<h6>Enabu, Rwahwire reflect on Uganda tenacity ahead of Morocco cracker</h6>
</div>
<div class="related_image left adaptive_image" data-adaptive-image-breakpoints="{ default: '/images.fiba.com/Graphic/2/F/7/8/8iqS79S7ukebtkZoyPWr0Q.jpg?v=20210113123220303', 480: '/images.fiba.com/Graphic/F/3/0/5/b8Zt49T0X0CHF0MVS2z46Q.jpg?v=20210113123217152' }" data-adaptive-image-extra-attrs="{ alt: '5 Jimmy Enabu (UGA)' }">
</div>
<div class="related_bottom right">
<p>SALE (Morocco) - Qualifying for the FIBA AfroBasket is a lifetime dream for many and for Uganda captain Jimmy Enabu, it is a befitting reward to a diligent servant of the game back home. </p>
</div>
</a>
</div>

We can see that all the information we want is stored inside this <div> tag with the class related_row. There is a <div> inside with the class date_highlighted that contains the date, one with the class category that contains the article category information. We can see the title of the article is wrapped in header <h6> tags and the blurb of the article is the only <p> tag within the <div>.

Using all this we can write a very simple loop to go through all of the related_row objects and pull out the pertinent information using the exact same methods we've already used.

Try parsing the FIBA news page into a pandas dataframe

date headline category blurb
0 07/07/2021 Enabu, Rwahwire reflect on Uganda tenacity ahe... News SALE (Morocco) - Qualifying for the FIBA AfroB...
1 11/03/2021 Madagascar's Botou wants to keep the AfroBaske... News ANTANANARIVO (Madagascar) – Despite having mis...
2 10/03/2021 10 standout performers from the last window of... News ABIDJAN - As we look back at the Second Round ...
3 08/03/2021 Guinea celebrate second straight AfroBasket ap... News Guinea left Cameroon at the end of the FIBA Af...
4 02/03/2021 Decisions concerning the February window of th... News FIBA has taken decisions regarding Equatorial ...
5 26/02/2021 Impressive operational efforts in FIBA Contine... News MIES (Switzerland) - Another successful window...
6 24/02/2021 "Senegal have some room for improvement," says... News DAKAR (Senegal) - Senegal finished top of Grou...
7 23/02/2021 History Makers Kenya's confidence is sky-high News By beating eleven-time Africa champions Angola...
8 21/02/2021 Four teams undefeated at the end of AfroBasket... Review MONASTIR/YAOUNDE (Tunisia/Cameroon) - The 20-t...
9 21/02/2021 Top performers at Day 3 in Yaounde News YAOUNDE (Cameroon) - There was great frenzy at...
10 21/02/2021 Top performers as curtains fall on FIBA AfroBa... News MONASTIR (Tunisia) - With prestige and honor a...
11 21/02/2021 Aristide Mouaha from mop boy to Cameroon inter... News YAOUNDE (Cameroon) - The game was in the third...
12 21/02/2021 Ongwae's buzzer-beater shocks Angola as Kenya ... Game Report YAOUNDE (Cameroon) On what was Kenya's biggest...
13 20/02/2021 Three tickets still available for AfroBasket 2021 Review On the day that Kenya caused the biggest upset...
14 20/02/2021 Romdhane show as Dokossi rises to summit News MONASTIR (Tunisia) - Of newbies and veterans e...

Selenium Library¶

Lets look at one final example piece of HTML. In this one we're going to use some basic javascript to add some elements to the page.

<html>
    <head>
    <script type='text/javascript'>
        window.onload = function(){
            for(i=0; i<5; i++){
                var paragraph = document.createElement('p');
                paragraph.innerHTML = 'This is paragraph '+i;
                document.body.appendChild(paragraph);
            }
        }
    </script>
    </head>
    <body>
    </body>
</html>

If we open this .html file up in a web browser we get what we'd expect which is 5 paragraph objects labeled 0 to 4.

But when we try our usual approach of opening this file up into beautifulsoup we get the following result.

from bs4 import BeautifulSoup

with open('./dynamic_javascript.html', 'r') as f:
    html = f.read()

soup = BeautifulSoup(html)
soup.find_all('p')
[]

This is because in order for the <p> tags to appear on the page there needs to be some process to execute the javascript that generates them. BeautifulSoup is just a HTML parser, it isn't able to execute javascript that is stored in the .html file.

This is a difficult problem to deal with, it would be useful if we could access the results of the .html file as it is rendered inside our browser, enter Selenium.

Selenium is a browser automation tool that is primarily used for testing websites, but can be put to a whole host of different tasks. What the Selenium library allows us to do is control a web browser using python and interact with the results.

Installing Selenium¶

Installing Selenium is a little trickier than most python packages, as in additional to a python library, we require a custom version of our web browser that can communicate with the Selenium library.

First we can pip install the selenium library

!pip install selenium

Then we need to go to https://sites.google.com/chromium.org/driver/ and download the latest stable release of our chrome driver. Now we need to make sure our selenium library can talk to the web driver, to do this we need to add it to our system path; a list of directories our computer looks for programs.

  1. Create the directory C:\bin
  2. Extract the chromedriver.exe into C:\bin
  3. Run the following command
setx PATH "%PATH%;C:\bin"
  1. Check this worked by restarting the command prompt and running the following
chromedriver -v

Using Selenium¶

Once Selenium is installed we can import and use it just like any other package, the syntax of selenium is very similar to the packages we've looked at so far. In order to start a Selenium browser session we have to specify the type of browser that we're planning to use. As we installed the Chrome driver we can do this with the following code.

from selenium import webdriver

driver = webdriver.Chrome()

You'll notice that running this code opens up a new browser window with the message Chrome is being controlled by automated test software. This is the browser that python is going to control.

Be careful running this cell multiple times, everytime webdriver.Chrome() is called, it will start a new browser but wont close the old one.

Now we have a our web driver running, we can tell it to nagivate around to various pages using the driver.get(). For example if we wanted to open the Rwanda Men's basketball wikipedia page we would use:

driver.get('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')

Similarly if we wanted to open our dynamic javascript page, we just need to tell the driver to navigate there. Opening a local file is a little different, first we need to use file:// rather than http:// and we also need to enter the full filepath, which we can get from os.getcwd().

import os 

local_file = 'file://'+os.getcwd()+'dynamic_javascript.html'
driver.get(local_file)

Once we've navigated our browser to the right page there are several methods we can use to extract data processed HTML. Nearly all methods have a find_element and find_elements version, returning the first and a list of all the matching elements respectively.

Driver Method Usage
find_element_by_id Select element by the id attribute
find_elements_by_name Select elements by the name attribute
find_elements_by_xpath Select elements by an XML path
find_elements_by_link_text Select elements with specific hyperlink text
find_elements_by_partial_link_text Select elements matching part of hyperlink text
find_elements_by_tag_name Select elements by tag name / tag type
find_elements_by_class_name Select elements with the same class
find_elements_by_css_selector Select elements by CSS selectors

We can use the find_elements_by_tag_name method to collect our dynamically generated <p> tags. What we get is a list of WebElement objects. We can use the .text property to retrieve the text inside the tag, or use .get_attribute('outerHTML') to extract the full tag as a string

import os
from selenium import webdriver 

driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + '/dynamic_javascript.html'

driver.get(local_file)
p_tags = driver.find_elements_by_tag_name('p')

for tag in p_tags:
    print(type(tag))
    print(tag.get_attribute('outerHTML'))
    print(tag.text)
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 0</p>
This is paragraph 0
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 1</p>
This is paragraph 1
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 2</p>
This is paragraph 2
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 3</p>
This is paragraph 3
<class 'selenium.webdriver.remote.webelement.WebElement'>
<p>This is paragraph 4</p>
This is paragraph 4

In some cases it may be more useful to use Selenium to generate the page, but then parse the resulting HTML using BeautifulSoup. Fortunately Selenium allows us to access the full HTML of the page including all of the generated elements.

import os 
from selenium import webdriver 

driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + 'dynamic_javascript.html'

driver.get(local_file)
driver.get(local_file)
html = driver.page_source
soup = BeautifulSoup(html)

soup.find_all('p')
[<p>This is paragraph 0</p>,
 <p>This is paragraph 1</p>,
 <p>This is paragraph 2</p>,
 <p>This is paragraph 3</p>,
 <p>This is paragraph 4</p>]

Practical Example¶

Lets look at the fiba.basketball news page. If we go to the bottom of the page we can see there is a button that says Show More News. This button dynamically loads more news onto the page we're currently viewing.

We wouldn't be able to get this using requests alone, but maybe selenium can help. First we need to tell selenium that we want to click that button, but before we can click it we need to find it.

Using the Inspect tool we can see that the button has the class show_more_button so we can use that and a selenium class selector to isolate the element.

Once we've done that we can use the built in click method for WebElements to simulate us clicking the the Show More News button.

from selenium import webdriver 

driver = webdriver.Chrome()

URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'
driver.get(URL)

button = driver.find_element_by_class_name('show_more_button')
button.click()

Now if we use the same code we used previously to scrape this information, swapping out the request's call for the Selenium one, we can parse even more news than we did previously.

Note, clicking on the button doesn't generate the new news items instantly, it takes a moment for the browser to collect them. We need to add a wait function using time.sleep to wait for the page to load before we can scrape the data.

date headline category blurb
0 07/07/2021 Enabu, Rwahwire reflect on Uganda tenacity ahe... News SALE (Morocco) - Qualifying for the FIBA AfroB...
1 11/03/2021 Madagascar's Botou wants to keep the AfroBaske... News ANTANANARIVO (Madagascar) – Despite having mis...
2 10/03/2021 10 standout performers from the last window of... News ABIDJAN - As we look back at the Second Round ...
3 08/03/2021 Guinea celebrate second straight AfroBasket ap... News Guinea left Cameroon at the end of the FIBA Af...
4 02/03/2021 Decisions concerning the February window of th... News FIBA has taken decisions regarding Equatorial ...
5 26/02/2021 Impressive operational efforts in FIBA Contine... News MIES (Switzerland) - Another successful window...
6 24/02/2021 "Senegal have some room for improvement," says... News DAKAR (Senegal) - Senegal finished top of Grou...
7 23/02/2021 History Makers Kenya's confidence is sky-high News By beating eleven-time Africa champions Angola...
8 21/02/2021 Four teams undefeated at the end of AfroBasket... Review MONASTIR/YAOUNDE (Tunisia/Cameroon) - The 20-t...
9 21/02/2021 Top performers at Day 3 in Yaounde News YAOUNDE (Cameroon) - There was great frenzy at...
10 21/02/2021 Top performers as curtains fall on FIBA AfroBa... News MONASTIR (Tunisia) - With prestige and honor a...
11 21/02/2021 Aristide Mouaha from mop boy to Cameroon inter... News YAOUNDE (Cameroon) - The game was in the third...
12 21/02/2021 Ongwae's buzzer-beater shocks Angola as Kenya ... Game Report YAOUNDE (Cameroon) On what was Kenya's biggest...
13 20/02/2021 Three tickets still available for AfroBasket 2021 Review On the day that Kenya caused the biggest upset...
14 20/02/2021 Romdhane show as Dokossi rises to summit News MONASTIR (Tunisia) - Of newbies and veterans e...
15 20/02/2021 Ongwae, Ndoye, Nzeulie and Obiekwe dazzle in Y... News Magical and electrifying may not come close to...
16 20/02/2021 Liz Mills trailblazing for more female coaches... News YAOUNDE (Cameroon) - A dream nursed in Sydney,...
17 20/02/2021 Iroegbu brothers excited to continue Nigerian ... Long Read MONASTIR (Tunisia) - Playing with your sibling...
18 20/02/2021 Cote d'Ivoire's Konate defying age in style News YAOUNDE (Cameroon) - Cote d'Ivoire's Stephane ...
19 20/02/2021 Kouguere magic as Central African Republic edg... Game Report MONASTIR (Tunisia) - Central African Republic ...
20 19/02/2021 Nshobozwabyosenumukiza, Diogu sign out on a high News MONASTIR (Tunisia) - Rwanda point guard Jean J...
21 19/02/2021 Morais, Diallo, Mansare and Thompson star in Y... News YAOUNDE (Cameroon) - There was fireworks at th...
22 19/02/2021 Rwanda pick first win as four more teams quali... Review MONASTIR/YAOUNDE (Tunisia/Cameroon) - Rwanda p...
23 19/02/2021 Angola's Leonel Paulo bringing his experience ... News YAOUNDE (Cameroon) - When you've played at fiv...
24 19/02/2021 FIBA Statement about the February FIBA AfroBas... Statement Following the Covid-19 Protocol for FIBA Offic...
25 19/02/2021 Luol Deng's South Sudan revel in FIBA AfroBask... News MONASTIR (Tunisia) - It has been a long journe...
26 19/02/2021 Senegal's Ndoye "We want to stay unbeaten" News YAOUNDE (Cameroon) - Whenever five-time Afroba...
27 18/02/2021 Kuany, Doucoure and Omoerah in cloud nine at F... News MONASTIR (Tunisia) - Kuany Ngor Kuany was at t...
28 18/02/2021 South Sudan, Mali qualify for AfroBasket 2021,... Review Day 2 of February's window of the FIBA AfroBas...
29 18/02/2021 Players to watch out for in FIBA AfroBasket 20... News YAOUNDE (Cameroon) - The third and final windo...